Tolerating Network Failures in System Area Networks
نویسندگان
چکیده
In this paper, we investigate how system area networks can deal with transient and permanent network failures. We design and implement a firmware–level retransmission scheme to tolerate transient failures and an on–demand network mapping scheme to deal with permanent failures. Both schemes are transparent to applications and are conceptually simple and suitable for low–level implementations, e.g. in firmware. We then examine how the retransmission scheme affects system performance and how various protocol parameters impact system behavior. We analyze and evaluate system performance by using a real implementation on a state–of–the art cluster and both micro– benchmarks and real applications from the SPLASH-2 suite.
منابع مشابه
Dynamic Data Replication for Tolerating Single Node Failures in Shared Virtual Memory Clusters of Workstations
In this paper we investigate how shared memory clusters can take advantage of replication to tolerate single system failures. We start from a shared virtual memory protocol (GeNIMA) that has been optimized for low-latency, highbandwidth system area networks. We propose a set of extensions that maintain shared data consistent in the presence of failures and support SMP nodes. Our scheme uses dyn...
متن کاملReliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملDeadlock-free dynamic reconfiguration over InfiniBandTM NETWORKS
InfiniBand Architecture (IBA) is a newly established general-purpose interconnect standard applicable to local area, system area and storage area networking and I/O. Networks based on this standard should be capable of tolerating topological changes due to resource failures, link/switch activations, and/or hot swapping of components. In order to maintain connectivity, the network’s routing func...
متن کاملTopology management techniques for tolerating node failures in wireless sensor networks: A survey
In wireless sensor networks (WSNs) nodes often operate unattended in a collaborative manner to perform some tasks. In many applications, the network is deployed in harsh environments such as battlefield where the nodes are susceptible to damage. In addition, nodes may fail due to energy depletion and breakdown in the onboard electronics. The failure of nodes may leave some areas uncovered and d...
متن کاملTolerating VLR and HLR Failures in Two-Tier PCS Networks
One of the key issues in the design of personal communication service (PCS) systems is the efficient management of location information. The current IS-41 and GSM schemes use a two-tier system of home location register (HLR) and visitor location register (VLR) databases. In a two-tier system, the success of a call requires the HLR and the callee’s current VLR to be failure-free. A VLR failure a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002